Welcome to our final project! We are Macalester College students (class of 2021/2022) from the department of Mathematics, Statistics, and Computer Science. We took the course, Advanced Data Science in R (STAT 494), during spring semester, 2021. Below is our final project for this course.

Introduction

Computer Science is a field that is growing rapidly in the United States and around the world today. Advancements in computer science from industry are constantly being released and technology is becoming more ingrained into our daily lives. The increasing demand of computer scientists had caused the occupation to grow in popularity. To meet the demand, educational institutions and systems are increasing the amount of courses offered in order to train more future computer scientists. This development started at the college level, where majoring in computer science is becoming a widely available option. At Macalester College, it is one of the largest departments for both students and faculty. While the availability of courses at the college level is an amazing start, there is a big push to have computer sciences courses offered in K-12 education. Offering computer science courses in elementary and secondary schools provides an opportunity for kids to expose themselves to coding. This could lead younger students to discover new interests and get engaged with computer science earlier. Often, being exposed to computer science at a younger age can make students more comfortable with the material and the field later on. This can lead to a more empowered and diverse set of students entering the workforce or higher education. Given the importance of having computer science courses available in K-12 education, we decided to explore the availability of computer science courses in K-12 school districts in Minnesota. For our Advanced Data Science final project, we will explore the connection between a variety of datasets related to this topic, including K-12 computer science course availability in Minnesota, demographic information from the U.S. census, ACT scores, and funding.

## Reading layer `mn_acs_ss_act_pred' from data source `/Users/anaelkuperwajs/STAT 494/STAT494-Final-Project/Website/mn_acs_ss_act_pred/mn_acs_ss_act_pred.shp' using driver `ESRI Shapefile'
## Simple feature collection with 324 features and 94 fields (with 1 geometry empty)
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -97.23921 ymin: 43.49936 xmax: -89.49174 ymax: 49.38436
## Geodetic CRS:  NAD83

Computer Science K-12 Course Availability in Minnesota

To begin with, let’s explore what computer science course availability already exists in the state of Minnesota for K-12 education. The two plots below show the various public school districts in the state with the amount and variety of computer science courses offered in each district.

Demographics, ACT Scores, and Funding

Due to the fact that public schools are funded by property taxes, course availability is usually an intersectional issue that is reliant on other factors. We hypothesized that there would be a correlation between course availability and overall wealth and access to resources of each district. In this section, we explore some of the variables we expected to be significant in relation to course availability.

Connections

Now that we have introduced you to our various datasets, we will show you how these connect and see if there is a correlation between course availability and demographic variables, ACT scores, and funding.

Modeling Computer Science K-12 Course Availability in Minnesota

To understand what factors have the largest influence on course availability, we created two models to predict the amount of computer science courses per district in the state of Minnesota. The first model was LASSO, a linear regression method that shrinks coefficients even to zero to eliminate insignificant variables. With over 80 possible predictors, it would be difficult to quantitatively select variables for ordinary least squares and including everything would lead to overfitting. The second one we fitted was a random forest. A random forest consists of a large number of decision trees and averages the prediction over these trees.

Before fitting the models, the main transformation we had to perform was log-transformation for many of the variables from the Annual Survey of School System Finances. These were raw tallies of revenue or expenditure, so the data were right-skewed with a few districts having significantly higher values than the majority. Based on the RMSE, the random forest greatly outperformed the LASSO, with an RMSE of approximately 1.86 compared to the LASSO’s 4.11.

In a random forest model, some variables will have higher predictive power and contribute more to the outcome. Below is a plot ranking our predictors in terms of their importance:

Each bar shows how much the RMSE would change if the corresponding variable was permuted. If permuting a certain variable significantly increases the RMSE relative to permuting other variables then it would be important. Here, the RMSE increases the most when revenue from the Child Nutrition Act, spending on instructional staff, and total expenditure are permuted. The highest-ranking variables all came from the School Survey, and the top 3 most important demographic variables from the ACS are percent of the total population who are black alone, percent of households with Internet subscription, and percent of households receiving SSI, public assistance, or foodstamps (in each district). The variables at the bottom showing no change in RMSE if permuted were excluded from the modeling right from the beginning as they are ID or raw demographic variables (for these we used their percentage version).

Implications

TODO: writeups

For more information about how we created this project, please visit:

GitHub: https://github.com/anaelkuperwajs/STAT494-Final-Project

Behind the scenes: https://github.com/anaelkuperwajs/STAT494-Final-Project/blob/main/behind_the_scenes.Rmd